Skip to content

Conversation

@piergm
Copy link
Member

@piergm piergm commented Oct 15, 2025

With this PR we are introducing a new Transport level parameter to include resolutions for each original expression. This parameter can't be set from the REST layer and that do not affect the REST API response.
The parameter name is includeResolvedTo, that defaults to false (current behaviour) and if set to true will populate the fieldCaps response resolvedLocally and possibly resolvedRemotely (the latter only in CCS/CPS context).

resolvedLocally is of type ResolvedIndexExpressions that contains a list of ResolvedIndexExpression. The ResolvedIndexExpression list has one entry for each original index expression passed to FC and each element of the list will contain the original (unchanged) expression passed to FC, what it was resolved to locally (concrete indices also with the result of the resolution: SUCCESS/UNAUTH/NOT_VISIBLE) and, if in CPS context, also a list of remote expressions that the original expression was resolved to (flat logs* will have remote expressions p1:logs*and p2:logs* given that p1 and p2 are linked project that are not projectRouting excluded).

resolvedRemotely is only populated in CCS/CPS context and contains a map form the remote/project alias to the ResolvedIndexExpressions for that remote/project.
Note: resolvedLocally and resolvedRemotely.get(projectAlias) are both of type ResolvedIndexExpressions but the latter can't contain remoteExpression (because we are not allowing CCS/CPS chaining) and the original expression will contain the expression that the remote project received, therefore if the coordinator want to query p1:logs*, the p1 response will contain original="logs" and that ResolvedIndexExpression will be stored into the resolvedRemotely at key p1.
Note also that resolvedRemotely is transient and never sent through the wire between the remotes/linked project for the reasons I listed above. If in the future we want to enable CPS changing we can remove this limitation.

The decision to re-use ResolvedIndexExpression and not to create another ad-hoc type without the remote expression was done because we want to avoid the proliferation of nearly-identical records.

Code written together with @alexey-ivanov-es 😄 🙏

listener.onResponse(FieldCapabilitiesResponse.builder().withFailures(failures).build());
} else {
// throw back the first exception
listener.onFailure(failures.get(0).getException());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a bit tricky for CCS/CPS index resolution. As commented it literally re-throws back the first exception as is it was reported by remote.

For example for FieldCapsRequest(indices=[r1:index1,r2:index1,r3:index1]) (note index names are the same, only remotes are different), it could throw something like: org.elasticsearch.transport.RemoteTransportException with a cause org.elasticsearch.index.IndexNotFoundException: no such index [index1]. Unfortunately this lacks the information about remote originally caused the issue even though we know it from failures.get(0).getIndices() (although it might be caused my multiple indices).

This information is very important for ES|QL (and for other usecases) in order to properly report index resolution issues.

I believe there are 2 ways to resolve it:

  • introduce a flag that causes FC to always resolve listener nonExceptionally with list of result and failures (basically first branch here). This way each caller might deduce the root cause of the issue on top of all exceptions and construct corresponding error message.
  • introduce a flag that causes FC to wrap failures into new exception (maybe IndexResolutionException) retaining the index information and summarizing the errors. This way error reporting/formatting logic could be the same for all callers.

I believe this should be done separately. Let me open a draft/proposal with such change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idegtiarenko created a pr to address this: #136680


private final String[] indices;
private final ResolvedIndexExpressions resolvedLocally;
private final transient Map<String, ResolvedIndexExpressions> resolvedRemotely;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why resolvedLocally are serialized but resolvedRemotely not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the "coordinating" node can populate resolvedRemotely, because it knows from which remote the response was from.
Also, since chaining is disabled each FieldCapsResponse can only have the locally resolved index expressions as of now.

So. If the coordinating node receives the following index expression (in ccs notation) local-index,r*:remote-index
The coordinating node will populate resolvedLocally and forward the "r*:remote-index" to all matching remotes/projects.
The remotes/projects will then "resolveLocally" and respond back to the coordinating node. So for the remotes/projects we will never have the resolvedRemotely section populated.

When the coordinator receives back the responses from the remotes will then populate its local resolvedRemotely based on the responses it gets back.
resolvedRemotely then can be used either by ES|QL code to get infos on which original index expression what resolved to what and where and/or by CPS error handling code to possibly throw errors.

Sounds reasonable?
LMK if you have any other further questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify - is the coordinator node always the same node that sent the original FieldCapabilitiesRequest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the coordinator node always the same node that sent the original FieldCapabilitiesRequest?

Yes, the coordinator node is the node receiving the rest FC request, that node will then forward the requests to other data nodes and/or to other clusters as needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this answer is correct, but it's worth triple checking: there are situation in CCS where requests are proxyed, hence read from a node and written directly to another one.

Yes, the coordinator node is the node receiving the rest FC request

Agreed, that's correct. There are also cases where the call is not originated from REST, and the transport field caps action is called directly via a client. With that said, the node that coordinates the field caps execution is one and only one, across multiple clusters/projects, correct? (This is slightly different in search with minimize roundtrips for instance).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: I find the distinction between resolvedLocally and resolvedRemotely hard to follow. Maybe it's me. Isn't the difference between the two the key in the data structure - original expression vs remote alias - ? Could you add a full example of a real life situation and how the two would be populated and how the info is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node that coordinates the field caps execution is one and only one, across multiple clusters/projects, correct?

Yes, that is correct, no matter if an internal user (ES|QL) calls the transport action directly or if the request goes through the REST layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the distinction between resolvedLocally and resolvedRemotely makes the most sense in CPS. To further explain it immagine this: we have three projects, origin with logs-es, p1 with metrics-es and p2 empty.
The client asks for two flat expressions logs* and metrics*.
The resolvedLocally will have the following structure (populated before contacting the linked projects):

for logs*
original: logs*
localExpressions: [{logs-es, SUCESS}]
remoteExpressions: [p1:logs*, p2:logs*]

metrics*
original: metrics*
localExpressions: [{NOT_VISIBLE}]
remoteExpressions: [p1:metrics*, p2:metrics*]

Then we send the request to both p1 and p2 and we get back:

from p1:
original: logs*
localExpressions: [{NOT_VISIBLE}]
original: metrics*
localExpressions: [{metrics-es, SUCESS}]

and from p2:
original: logs*
localExpressions: [{NOT_VISIBLE}]
original: metrics*
localExpressions: [{NOT_VISIBLE}]

Now we need to check if we have to throw errors based on the flat world resolution. For "logs*" we found it locally, so no error need to be thrown (and at this point in time we don't really care to check if it was found in linked projects).

For "metrics*" we check locally, we see "NOT_VISIBLE", therefore we loop on all the "remoteExpressions" inside the resolvedLocally and, split project alias, check that project alias inside resolvedRemotely and check if the index expression is present there with "SUCCESS", this is only the case for p1 and not p2.

Hope this make sense, I know it's a bit convoluted.
That said can it be done with only one map? Yes, but would be more confusing IMO. Plus by having the distinction we also "forcibly" disable project chaining, since we never send through the wire resolvedRemotely but only resolvedLocally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a little bit of the above as a code comment. I know this is just the response class, but at least a comment as to why resolvedRemotely is not serialized and what the distinction be resolvedRemotely and resolvedLocally is - that will help future maintainers with debugging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@piergm piergm marked this pull request as ready for review October 23, 2025 09:24
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 23, 2025
@piergm piergm changed the title WIP Field caps transport changes to return for each original expression what it was resolved to Field caps transport changes to return for each original expression what it was resolved to Oct 23, 2025
@piergm piergm self-assigned this Oct 23, 2025
@piergm piergm added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Oct 23, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Oct 23, 2025
@idegtiarenko
Copy link
Contributor

I believe this does not properly support exclusions in the new structure yet.
I do not mind adding that as a followup, although I am surprised, the following fails:

    public void testExclusion() throws InterruptedException {
        assertAcked(
            prepareCreate("index-2024"),
            prepareCreate("index-2025")
        );

        prepareIndex("index-2024").setSource("timestamp", "2024", "f1", "1").get();
        prepareIndex("index-2025").setSource("timestamp", "2025", "f2", "2").get();

        var r = client().prepareFieldCaps("index-*", "-*2025").setFields("*").get();
        assertThat(List.of(r.getIndices()), contains("index-2024"));
    }

I believe the same passes on 9.1 or on main.

}
}

public void testResolvedToMatchingEverywhere() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be difficult (impossible?) in the IT test framework, but is there a way to test that cross-cluster chaining does not occur here? (We'd need to link a remoteB to our remote-cluster, but not the origin cluster).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could maybe connect origin -> remote -> origin to reproduce that sceario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. CPS chaining is impossible because we are not overriding allowCrossProject thus with this PR the endpoint will not be CPS compatible.
  2. In this repo we are not able to test the project linking scenario outlined in the comment.

This is though a valid point. I have a change ready to 1) allowCrossProject and 2) test the scenario.

As a side note: With the resolvedLocally/resolvedRemotely data structure distinction and since we are only sending through the wire resolvedLocally for each remote/project we effectively prevent chaining (at least prevent it from surfacing it to the user, actual prevention needs to be done via changing the indicesOptions resolveCrossProject to false when we send the request to the linked projects)

index,
new ResolvedIndexExpression.LocalExpressions(
Set.of(),
ResolvedIndexExpression.LocalIndexResolutionResult.CONCRETE_RESOURCE_NOT_VISIBLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(asking to learn) - I see that Nikolaj already added this LocalIndexResolutionResult enum, but what does CONCRETE_RESOURCE_NOT_VISIBLE mean? Not sure how that differs from CONCRETE_RESOURCE_UNAUTHORIZED.

And then for your code here, how do you know that based on it being a failure it is NOT_VISIBLE vs. UNAUTHORIZED?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had exactly the same question :) tuning in to read the answer on this thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was introduced to distinguish between an auth error or an IndexNotFound error. This distinction is needed in order to return 403 or 404 to the user and we need to know it also from the linked projects.
CONCRETE_RESOURCE_NOT_VISIBLE means that a non-wildcard expression was resolved to nothing, either because it doesn't exists or because it is closed.
CONCRETE_RESOURCE_UNAUTHORIZED means that the expression could be resolved to a concrete index but the requesting user has no authorization to access it.

NB: I can expand on why CONCRETE_RESOURCE_NOT_VISIBLE applies only to non-wildcard if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the answer may be that we have to trust the security decisions that were already made, but isn't unauthorized already giving away the info that the index does exist? That's no longer a concern then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, earlier we discussed with @n1v0lg that we want to "hide" if an index exists and it's un auth or does not exists/is closed. This is no longer the case, we are already sending to the users in CCS 403 if un auth or 404 if not found and we want to keep this distinction also in CPS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer may be that we have to trust the security decisions that were already made

Long story short yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we add code comments for CONCRETE_RESOURCE_NOT_VISIBLE in LocalIndexResolutionResult to clarify that. Even reading the javadoc on that enum that's still not obvious/clear. Doesn't have to be in the PR, just a general request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in LocalIndexResolutionResult to further explain it as suggested!

Copy link
Contributor

@idegtiarenko idegtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be passing all the tests.
I am inclined to merge it into main to enable integration into ESQL while we are still working on remaining features (like properly support exclusions in the new data-structure).

@piergm piergm requested a review from quux00 October 29, 2025 14:02
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor requests.

index,
new ResolvedIndexExpression.LocalExpressions(
Set.of(),
ResolvedIndexExpression.LocalIndexResolutionResult.CONCRETE_RESOURCE_NOT_VISIBLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we add code comments for CONCRETE_RESOURCE_NOT_VISIBLE in LocalIndexResolutionResult to clarify that. Even reading the javadoc on that enum that's still not obvious/clear. Doesn't have to be in the PR, just a general request.


private final String[] indices;
private final ResolvedIndexExpressions resolvedLocally;
private final transient Map<String, ResolvedIndexExpressions> resolvedRemotely;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a little bit of the above as a code comment. I know this is just the response class, but at least a comment as to why resolvedRemotely is not serialized and what the distinction be resolvedRemotely and resolvedLocally is - that will help future maintainers with debugging.

@idegtiarenko idegtiarenko mentioned this pull request Oct 31, 2025
// Locally resolved index expressions for the current project.
// This is sent over the wire from linked projects to the coordinating project of a request to
// inform the coordinating project of the local resolution state (for each remote).
private final ResolvedIndexExpressions resolvedLocally;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was a better way to name / explain this . It may be me, but when I read local here, I don't understand whether it's local to the linked projects or to the coordinating project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is tricky, because what's remote in one context is local in another context.
If you think it will be better I can specify that it's always local to the FieldCapsResponse creator. So if it's created in the coordinator it's local to the coordinator. if instead is created in one of the linked project is local to the linked project and remote wrt the coordinator. (thus saved in resolvedRemotely of the coord)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment updated, hopefully is clearer now

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@piergm piergm merged commit 9473216 into elastic:main Nov 3, 2025
34 checks passed
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025
BASE=ec0efafc2fdc49adffe4f58806ea71d28b7d9f49
HEAD=171872a6ca36f8594f0e7fa7ad2981aa23a8d1a2
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025
BASE=ec0efafc2fdc49adffe4f58806ea71d28b7d9f49
HEAD=171872a6ca36f8594f0e7fa7ad2981aa23a8d1a2
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=ec0efafc2fdc49adffe4f58806ea71d28b7d9f49
HEAD=171872a6ca36f8594f0e7fa7ad2981aa23a8d1a2
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants